26 July, 2019
order of included R functions:
lm()summary()ggplot() + statqq()ggplot() + geom_smooth()tidy(), augment(), glance()summ(), plot_summs()autoplot()glm() (different families)lmer()In a linear regression, we aim to find a model:
lm(y ~ x, data = data).
This is called the formula notation in R.
%>% in the following way: data %>%
lm(y ~ x, data = .)Let’s first have a look at the summary table of the example data set, by using the summary() command:
## day1 day2 day3 ## Min. :-1.611 Min. : 0.2396 Min. : 0.00 ## 1st Qu.: 3.031 1st Qu.: 9.9763 1st Qu.:13.25 ## Median : 5.619 Median : 15.9973 Median :19.37 ## Mean : 5.428 Mean : 16.7252 Mean :20.27 ## 3rd Qu.: 7.772 3rd Qu.: 20.6984 3rd Qu.:28.05 ## Max. :12.142 Max. :143.8187 Max. :39.03 ## NA's :4 NA's :16
Before we start with the linear regression model, we need to get an idea of the underlying data and its distribution. We know that the linear regression has the assumtptions:
tki_demo %>% gather(Days, measurement, day1:day3, factor_key = TRUE) %>% ggplot( aes(sample = measurement, color=Days)) + stat_qq() + facet_wrap(~Days)
lm() functionIn the plots, we could already see, that day 2 and day 3 seem to be associated. This is obvious when drawing a line in the plot. Let’s now perform a linear regression model in R.
lm(day2 ~ day3, data = tki_demo)
As said before, the first argument in the code is y, our outcome variable or dependent variable. In this case it is day2.
The second Argument is x, the independent variable. In our case: day3.
We also specify the data set that holds the variables we specified as x and y.
Now we want to look at the results of the linear regression. So how do we get the p-value and \(\beta\)-coefficient for the association?
In R, we add the summary() function to the lm() function, like so:
summary(lm(y ~ x, data = data))
We can also store our model results in a variable:model1 <- lm(y ~ x, data = data), and then use summary: summary(model1)
## ## Call: ## lm(formula = day1 ~ day3, data = tki_demo) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7.1940 -2.2388 0.3604 2.2765 7.2805 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.52452 0.88699 5.101 2.14e-06 *** ## day3 0.03819 0.04016 0.951 0.345 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.225 on 82 degrees of freedom ## (16 observations deleted due to missingness) ## Multiple R-squared: 0.0109, Adjusted R-squared: -0.001158 ## F-statistic: 0.904 on 1 and 82 DF, p-value: 0.3445
jtools and broomlm() resultsbroom R package is in line with the “tidy” data handling in R and turns the linear model results into an easy accessible tibble format:tidy(lm1)
## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.52 0.887 5.10 0.00000214 ## 2 day3 0.0382 0.0402 0.951 0.345
broom package helps with the accessibility of the output, but the style of the output is not very appealing for a publiation or a report.jtools package helps with this and has other nice functionalities such as forrest plots for coefficients and confidence intervals:export_summs(lm1)
| Model 1 | |
| (Intercept) | 4.52 *** |
| (0.89) | |
| day3 | 0.04 |
| (0.04) | |
| N | 84 |
| R2 | 0.01 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
model <- lm(day2 ~ day3, data = tki_demo) head(augment(model))
| .rownames | day2 | day3 | .fitted | .se.fit | .resid | .hat | .sigma | .cooksd | .std.resid |
| 1 | 19.4 | 29.6 | 22.1 | 2.52 | -2.7 | 0.0273 | 15.4 | 0.000451 | -0.179 |
| 3 | 22.8 | 39 | 27 | 4.05 | -4.2 | 0.0703 | 15.4 | 0.00307 | -0.285 |
| 4 | 4.61 | 9.41 | 11.4 | 2.68 | -6.84 | 0.0307 | 15.4 | 0.00327 | -0.454 |
| 5 | 19.9 | 14.7 | 14.2 | 1.99 | 5.63 | 0.017 | 15.4 | 0.0012 | 0.372 |
| 7 | 20.3 | 24.6 | 19.4 | 1.92 | 0.823 | 0.0158 | 15.4 | 2.37e-05 | 0.0543 |
| 9 | 13.7 | 9.47 | 11.5 | 2.67 | 2.23 | 0.0305 | 15.4 | 0.000346 | 0.148 |
autoplot(model)
Multiple linear regression works with the same function in R :lm(y ~ x + covar1 + covar2 + … + covarx , data = data)
export_summs(lm(day3 ~ day2 + male, data = tki_demo))
| Model 1 | |
| (Intercept) | 15.24 *** |
| (1.44) | |
| day2 | 0.15 * |
| (0.06) | |
| maleTRUE | 6.47 *** |
| (1.88) | |
| N | 80 |
| R2 | 0.21 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
With the demo data set:
export_summs(lm(day3 ~ day2 + male + smoker, data = tki_demo))
| Model 1 | |
| (Intercept) | 15.81 *** |
| (1.46) | |
| day2 | 0.16 ** |
| (0.06) | |
| maleTRUE | 7.11 *** |
| (1.89) | |
| smokerTRUE | -3.67 |
| (2.03) | |
| N | 80 |
| R2 | 0.24 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
With the demo data set:
lm1 <- lm(day3 ~ day2, data = tki_demo) lm2 <- lm(day3 ~ day2 + male + smoker, data = tki_demo) export_summs(lm1, lm2)
| Model 1 | Model 2 | |
| (Intercept) | 17.25 *** | 15.81 *** |
| (1.41) | (1.46) | |
| day2 | 0.16 ** | 0.16 ** |
| (0.06) | (0.06) | |
| maleTRUE | 7.11 *** | |
| (1.89) | ||
| smokerTRUE | -3.67 | |
| (2.03) | ||
| N | 80 | 80 |
| R2 | 0.09 | 0.24 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||
jtools package has a nice function to do this very easily, utilising ggplot2:plot_summs()plot_summs(lm1)
plot_summs(lm1, lm2)
plot_summs(lm1, lm2, coefs = "day2")
jtools we can access more information from the model in an easy step.summ(lm2, scale = TRUE, vifs = TRUE, part.corr = TRUE,
confint = TRUE, pvals = FALSE)$coeftable
## Est. 2.5% 97.5% t val. VIF partial.r ## (Intercept) 18.569528 16.2225902 20.9164657 15.758586 NA NA ## day2 2.568780 0.7826129 4.3549470 2.864328 1.021056 0.3121443 ## maleTRUE 7.107081 3.3480942 10.8660688 3.765636 1.041817 0.3965366 ## smokerTRUE -3.665084 -7.7050377 0.3748704 -1.806864 1.054610 -0.2029483 ## part.r ## (Intercept) NA ## day2 0.2862919 ## maleTRUE 0.3763784 ## smokerTRUE -0.1805975
lm() function. We use the glm() function instead.lm() function always assumes a linear distribution, whereas in the glm() function, we have to assign the distribution (“link”) and a family type ourselves.family = binomial(link = “logit”)family = poisson(link = “log”)export_summs(glm(smoker ~ male, data = tki_demo, family = binomial(link='logit')))
| Model 1 | |
| (Intercept) | -1.20 *** |
| (0.29) | |
| maleTRUE | 0.68 |
| (0.46) | |
| N | 100 |
| AIC | 120.41 |
| BIC | 125.62 |
| Pseudo R2 | 0.03 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
export_summs(glm(smoker ~ male,data = tki_demo, family = poisson(link='log')))
| Model 1 | |
| (Intercept) | -1.47 *** |
| (0.26) | |
| maleTRUE | 0.48 |
| (0.38) | |
| N | 100 |
| AIC | 129.74 |
| BIC | 134.95 |
| Pseudo R2 | 0.02 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
lme4 packageWhen we want to also account for random effects or clusters such as intervention if the effect of this is not what we are interested in, or study center in the case of a multi-center study, we use the lme4 package.
The formula works in the following way:
lmer(y ~ x + (1 + x | randomEffect), data = data) , for random slopelmer(y ~ x + (1 + x | randomEffect) + (1 | otherVariable), data = data), including otherVariable as a variable that has an impact on the slopelibrary(lme4) lmer(day1 ~ day2 + (1 + day2 | intervention), data = tki_demo )
## Linear mixed model fit by REML ['lmerMod'] ## Formula: day1 ~ day2 + (1 + day2 | intervention) ## Data: tki_demo ## REML criterion at convergence: 496.8176 ## Random effects: ## Groups Name Std.Dev. Corr ## intervention (Intercept) 1.278e-04 ## day2 5.528e-06 -1.00 ## Residual 3.147e+00 ## Number of obs: 96, groups: intervention, 3 ## Fixed Effects: ## (Intercept) day2 ## 4.79533 0.03436 ## convergence code 0; 1 optimizer warnings; 0 lme4 warnings
t.test(datx, daty)wilcox.test(datx, daty)chisq.test(group1, group2)lm()glm()family andlinklmer()broom and jtools packagestidy()augment(), glance()jtools packageexport_summs() for regression tableplot_summs() for forest plots